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Abstract 

The problem of recovering the sparsity pattern of a fixed but unknown vector /3* S W 
based on a set of n noisy observations arises in a variety of settings, including subset selection 
in regression, graphical model selection, signal denoising, compressive sensing, and construc- 
tive approximation. Of interest are conditions on the model dimension p, the sparsity index s 
(number of non-zero entries in /?*), and the number of observations n that are necessary and/or 
sufficient to ensure asymptotically perfect recovery of the sparsity pattern. This paper focuses 
on the information-theoretic limits of sparsity recovery: in particular, for a noisy linear obser- 
vation model based on measurement vectors drawn from the standard Gaussian ensemble, we 
derive both a set of sufficient conditions for asymptotically perfect recovery using the optimal 
decoder, as well as a set of necessary conditions that any decoder, regardless of its computa- 
tional complexity, must satisfy for perfect recovery. This analysis of optimal decoding limits 
complements our previous work [24 on sharp thresholds for sparsity recovery using the Lasso 
(^-constrained quadratic programming) with Gaussian measurement ensembles. 

Keywords: High-dimensional statistical inference; subset selection; signal denoising; compressive 
sensing; model selection; sparsity recovery; information-theoretic bounds; Fano's method. 

1 Introduction 

Suppose that we are given a set of n observations of a fixed but unknown vector (3* G W . In 
a variety of settings, it is known a priori that the vector f3* is sparse, meaning that its support 
set S — corresponding to those indices i for which (3* is non-zero — is relatively small, say with 
size \S\ =: sCp. Sparsity recovery refers to the problem of correctly estimating the support set 
S based on a set of noisy observations. This sparsity recovery problem is of broad interest, arising 
in various areas, including subset selection in regression [20], structure estimation in graphical 
models pjj] , sparse approximation ETJ , signal denoising [5] , and compressive sensing [U [3] . 

A great deal of work over the past few years has focused on the performance of computationally 
tractable methods, many based on t\ or other convex relaxations, both for recovering the exact 
sparsity pattern as well as related problems in sparse approximation. We provide a brief overview 
of those parts of this extensive literature most relevant to our work in Section 11.11 below. Of equal 
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interest and complementary in nature, however, are the information-theoretic limits associated 
with the performance of any procedure for sparsity recovery. Such understanding of fundamental 
limitations is crucial in assessing the behavior of computationally tractable methods. In partic- 
ular, there is little point in proposing novel methods for sparsity recovery, possibly with higher 
computational complexity, if currently extant and computationally tractable methods achieve the 
information-theoretic limits. On the other hand, an information-theoretic analysis can reveal where 
there currently exists a gap between the performance of computationally tractable methods, and the 
fundamental limits. Indeed, the information-theoretic analysis of this paper makes contributions 
of both types. 

With this motivation in mind, the focus of this paper is on the information-theoretic limitations 
of sparsity recovery. In particular, our analysis focuses on the noisy and high-dimensional setting, 
meaning that the observations are contaminated by noise, and all three problem parameters — the 
number of observations n, the model dimension p, and the sparsity index s, defined below — may 
tend to infinity. Our main results, stated more precisely in Section fl.21 are necessary and sufficient 
conditions on the triplet (n,p,s) for exact recovery. In particular, given noisy linear observations 
based on measurement vectors drawn from the standard Gaussian ensemble, we derive both a set 
of sufficient conditions for asymptotically perfect recovery using the optimal decoder, as well as a 
set of necessary conditions that any decoder must satisfy for perfect recovery. The analysis given 
here complements our earlier paper [M] that established precise thresholds on the success/failure 
of the Lasso (i.e., ^-constrained quadratic programming) for sparsity recovery. 

The remainder of this paper is organized as follows. In Section 11.11 we provide a more precise 
formulation of the problem, and a brief discussion of past work, whereas Section 11.21 provides a 
precise statement of our main results, and a discussion of their consequences. Section [2] and the 
appendices are devoted to the proofs of our main results, and we conclude in Section [3] with a 
discussion of open directions. 

1.1 Problem formulation and past work 

We begin with a more precise formulation of the problem, as well as a discussion of previous work, 
with emphasis on that most closely related to the results in this paper. Let (3* £ W be a fixed but 
unknown vector; we refer to the ambient dimension p as the model dimension. Define the support 
set of P* as 

S := {ie{l,...,p} I P*^o}. (1) 

We refer to its size s := \S\ as the sparsity index. Finally, suppose that we are given a set of n 
observations, of the form 

Y % = xJp* + W h i = l,...,n (2) 

where each x% £ W is a measurement vector, and W{ ~ N(0, a 2 ) is additive Gaussian noise. Of 
interest are conditions on the triplet (n,p,s) under which a given method either succeeds or fails 
in recovering the sparsity pattern S. 

Observation models: The linear observation model ([2]) can be studied in either its noiseless 
variant (a 2 = 0), or the noisy setting (a 2 > 0); this paper focuses exclusively the noisy setting. 
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In addition, previous work has addressed both deterministic families and random ensembles of 
measurement vectors {xj}™ =1 . The analysis in this paper is based on the standard Gaussian mea- 
surement ensemble, in which each measurement vector Xi is drawn from the zero-mean isotropic 
Gaussian distribution N(0, I pxp )- 

Error metrics: Consider some method that generates the vector f3 £ W as an estimate of the 
truth (3* . There are various distinct criteria for assessing how close the estimate is to the truth, 
including 

• various £ p norms E||/J — especially £2 and £\, or 

• some measurement of predictive power (e.g., E[||3^ — where Yi is the estimate based on 



Given the abundance of recent results on sparse approximation (not all of which are mutually 
comparable), it is particularly important to specify up front the choice of error metric. In this 
paper, we focus exclusively on the sparsity recovery problem, for which the appropriate error 
metric is simply the — 1 loss associated with the event of recovering the correct support S — viz.: 



Past work: Closely related in its information-theoretic spirit is the earlier paper of Fletcher et 
al. [H] that analyzed the standard Gaussian ensemble from a rate-distortion perspective, studying 
the average i^-error of the optimal decoder. The results given here also address the information- 
theoretic limitations, albeit of the sparsity recovery problem, using the error metric (|3]) as opposed 
to ^-norm. In a related but distinct line of work, the use of ^-relaxation for sparse approximation 
has a lengthy history; relatively early papers from the 1990s include the work of Chen, Donoho and 
Saunders [5j, as well as Tibshirani [22] on £\ -constrained quadratic programming (known as the 
Lasso in the statistics literature). A great deal of subsequent work has analyzed the performance of 
^-relaxations, both in the noiseless p21 [131 HB] and noisy setting [23] for deterministic ensembles, 
as well as the noiseless [lOj O [11] and noisy setting [USE [TU [271 121] for random ensembles. 
Other work has provided conditions under which estimation of a noise-contaminated vector via the 
Lasso [21 [9] or other types of convex relaxation [3] is stable in the £2 sense; however, such ^-stability 
does not guarantee exact recovery of the underlying sparsity pattern. 

A notable feature of the results given here is that they apply to completely general scaling 
of the triplet (n,p,s). In contrast, most previous work has addressed one of two possible special 
cases of sparsity scaling: (a) either the linear sparsity regime [e.g O \10\ [9], in which s = ap 
for some a £ (0,1); or (b) the sublinear sparsity regime [e.g., [T9], [27], in which s/p tends to 
zero. Depending on the underlying motivation for sparse approximation, both of these sparsity 
regimes are of independent interest. In covering the full range of scaling, the results given here are 
complementary to those of our previous paper [21] that provided threshold results, also applicable 
to general scaling of (n,p,s), for the success/failure of the Lasso when used for sparsity recovery 
with random Gaussian measurement ensembles. We discuss connections to previous work in more 
technical detail following the statement of our main results below. 
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1.2 Our contributions 



The analysis of this paper procedure is asymptotic in nature, focusing on scaling conditions on 
the triplet (n,p,s) under which asymptotically exact recovery is either possible or impossible. As 
mentioned previously, we focus on the linear observation model ([2j) in the noisy setting (a 2 > 0), and 
with the measurement vectors Xi drawn in an i.i.d. manner from the standard Gaussian N(0, I pX p) 
ensemble. A decoder is a mapping from the n-vector of observations Y to an estimated subset — say 
of the form S = <fi{Y). We think of the underlying true vector (3* £ W with its support S randomly 
chosen, uniformly over all ( p ) subspaces of size s. Accordingly, the average error probability Perr 

of 

any decoder is given by 

= m H n<t>(Y)^s | s}. 

U s, \S\=s 

Here the term ¥[<j)(Y) ^ S \ S] corresponds to the probability, conditioned on the true underlying 
support being S and averaging over the measurement noise W, the choice of Gaussian random 
matrix X, and the choice of the entries (3* s on the fixed support S, that the decoder makes an error. 
We say that 

• the sparsity recovery is asymptotically reliable (error-free) if Perr(4>) as n — > +oo, and 

• the sparsity recovery is asymptotically unreliable if for some constant c > 0, the error proba- 
bility stays bounded p cr r(<ft) > c as n — > +oo. 

In addition to the three parameters (n,p,s), our results also involve the minimum value of the 
unknown vector (3* on its support, given by 

M(JT) := min|/3;|. (4) 

We begin by stating a set of conditions on the triplet (n,p, s) which are sufficient to ensure asymp- 
totically perfect recovery of the sparsity pattern: 

Theorem 1 (Sufficient conditions). If (n — s)Ai 2 ((3*) — ► +oo, then the following condition suffices 
to ensure asymptotically reliable recovery: for some fixed constant C > 0, 

n > C max j slog (p/s), j^j^ lo g(P ~ s )| • ( 5 ) 

The proof of this claim, given in Section 12.21 is constructive in nature, based on direct analysis of 
the error probability associated with the optimal decoder. 

Theorem 2 (Necessary conditions). Asymptotically reliable recovery is impossible under the fol- 
lowing condition: for some fixed constant C > 0: 



n < 



a 



sM 2 (p* 



slog-. (6) 

s 



The proof of this claim, given in Section 12.31 is somewhat more indirect in nature, based on 
exploiting a corollary of Fano's inequality [6} I15 | [16l 26j. in order to lower bound the probability 
of error for a restricted hypothesis testing problem. To interpret these results, we consider two 
distinct regimes of sparsity: 
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Regime of sublinear sparsity: First suppose that the sparsity is sublinear, meaning that 
s = o(p). Based on the two theorems, we identify the critical scaling as A4 2 (/3*) = 0(l/s). With 
this scaling, the sufficient condition in Theorem [1] reduces to n > C s max{log(p — s),log^}, 
whereas the necessary condition in Theorem [2] reduces to n < C s log | . For many choices of sub- 
linear sparsity (e.g., s = 0(^/p)), we have log | = f2(log(p — s)) — o(l), so that we can summarize 
the two conditions as a threshold of the order n = @(slog(p — s)). To compare with our previ- 
ous work [23] on computationally tractable methods, we established that ^-constrained quadratic 
programming (Lasso) has a threshold^ for success/failure of order n = (slog(p — s)), so that the 
Lasso essentially achieves the information-theoretic bounds. 

Regime of linear sparsity: Next consider the regime of linear sparsity, in which s = ap for 
some a £ (0, 1). Considering first the sufficient conditions of Theorem [IJ we see that as long as 
M. 2 (f3*)s — ► +oo, then n = Q(p) observations are sufficient to ensure asymptotically reliable re- 
covery. This information-theoretic condition should be compared with our earlier analysis [23] 
of ^-constrained quadratic programming (the Lasso); one consequence of this work is that if 
n < 2slog(p — s), then the Lasso fails with probability converging to one, even if Ai 2 (f3*) stays 
bounded away from zero. Given that 2slog(p — s) 3> Q(p) for linear sparsity s = ap, we see 
that there is a substantial gap between the performance of the Lasso and the optimal decoder in 
the linear sparsity regime. Thus, Theorem [T] raises the interesting question as to the existence 
of computationally efficient techniques for asymptotically reliable recovery in the regime of linear 
sparsity. 

2 Analysis 

This section is devoted to the proofs of Theorems CD and EJ We begin by setting up some useful 
notation to be used throughout the remainder of the paper. 

2.1 Notation and set-up 

For compactness in notation, let us use X to denote the n x p matrix formed with the vectors 
Xk = {xki,Xk2, ■ ■ ■ , Xk p ) £ K p as rows, and the vectors Xj = (x±j,X2j, ■ ■ ■ , x n j) T S W 1 as columns, 
as follows: 



= [X x X 2 ••• X p ]. (7) 



Using Y and W to denote the n-dimensional observation and noise vectors respectively, we can 
re- write our linear observation model ([2]) in matrix- vector form as follows: 

Y = X(5* + W. (8) 

1 Those results [24] allowed the minimum value to scale as M 2 (f3*) = f(s)/s, where / is any function such that 
lim s ^+oo f(s) = +oo. 



X 



T 

x\ 
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Given any subset V C {1, . . . we use the notation (3 V to denote the \V\ -dimensional subvector 
{ft* , i G V}, and similarly for other vectors (e.g., Y, etc.). In an analogous manner, we use Xy 
to denote the n x \V\ matrix with columns {Xi, i S V}. From herein, we assume without loss of 
generality that a 2 = 1, so that W ~ N(0,I nxn ) is simply a standard Gaussian vector. (Note that 
any scaling of a can be accounted for in the scaling of f3* , via the parameter iW (/?*).) 

In addition, we use the following standard notation for asymptotics of real sequences {a n } and 
{b n }: (i) a n = 0(b n ) means that a n < Cb n for some constant C G (0, oo); (ii) a n = Q(b n ) means 
that a n > C'b n for some constant C' £ (0,oo); (iii) a n = Q(b n ) is shorthand for a n = 0(b n ) and 
a n = £l(b n ), and (iv) a n = o(b n ) means that a n /b n — > 0. 

2.2 Proof of Theorem CD 

Optimal decoding: We begin by describing the "best" decoder, that is optimal in terms of 
minimizing the probability of error p eTT ((p) over all decoding rules. It is based on the following 
real- valued function, defined on the subsets U C {1, . . . ,p}, as 

f(U;Y,X,/3*) = nrgminiWY-Xupug}. (9) 

Pu 

We frequently write f(U) as a shorthand; note that this value corresponds to the error associated 
with the best estimator of Y that lies in Ha(Xjj). The optimal decoder chooses the best subset S 
based on the minimal value of this error, ranging over all subsets U of size s: 

S = O ptCn := &rg mm f(U;Y,X,P*). (10) 

\U\=s 

Note that by symmetry, the error probability ¥[S ^ S \ 5] is in fact the same regardless of which 
underlying set 5 acts as the true one. Consequently, we can view the choice of S as fixed (and 
hence non-random), and write 

Pev M = P[0(Y) + S], (11) 
which should now be understood as an unconditional probability (with S fixed). 



Analysis of error probability: Consider the difference A(U) := f(U) — f(S) between the 
reconstruction error f(S) using the true subset S, versus the error f(U) candidate subset U. For 
any subset U such that Xjj is full rank, define the n x n matrices 



x xT 



lit/ := Xjj [XyXu] 
IIj/ := I n xn — Xjj [X^Xu\ 



and 



-1 X T 



(12a) 
(12b) 



Note that II^ and 11^ are both orthogonal projection matrices, associated with the s-dimensional 
range space Ra(Xi/) and (n— s)-dimensional nullspace Kei(Xu) respectively. With these definitions, 
we state the following result (see Appendix [A] for a proof): 

Lemma 1. For a given vector (3* with support S, the optimal decoder declares U over S if and 
only if the random variable 



A(U) 



(Xs\uP*s\u + W 



(13) 



is negative. 
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Overall, the optimal decoder fails if and only if at least one U (with cardinality \U\ = s) is preferable 
to S; consequently, the probability of error can be written as 



U {A(10 < 0} 

U^S, \U\=s 



(14) 



In order to analyze this error probability, we begin by considering the range of possible integers 
k := \S\U\, corresponding to the complement of the overlap. The following lemma characterizes 
the exponential decay rates of the random variable A({7): 



Lemma 2. For fixed k (with 1 < k < s), we have for any U with \S\U\ = k, 

-(n-s)\\P*^ 2 



F[A(U) < 0] < exp 



12 



J s\u\ 







| + 2 exp < 





1, , WPs\u\ 

1 + A {n - s) ^r 



(15) 



Proof. We begin by conditioning on the Gaussian noise vector W. Since each element of X s \u is 
standard normal, each entry of the random vector Xg^ufig^ is zero- mean Gaussian with variance 
H^S\[/II 2, Consequently, if we rescale by the standard deviation, then the random vector 



J s\u\ 



s\u 



+ W 



is an n-dimensional Gaussian random vector with independent entries, each with with unit variance, 
and mean vector W . Applying the orthogonal transform 11^ reduces the number of degrees of 
freedom to (n — s), so that we conclude that 



J s\u\ 



nu[Xs\uP*s\u + W 



is a non-central x 2 variate with d = n — s degrees of freedom, and non-centrality parameter 



J s\u\ 



2 \\IihW\\ 2 . With these choices of (d, v), we have 



P[A(C/) < | W] 



[ X 2 (d,v)<t] 



where we have set t 



iin^wii 2 

for shorthand. Thus, conditioned on W, our problem reduces to 



J s\u< 



bounding the tail of a non-central \ 2 variate. In Appendix [Dj we state some known tail bounds pQ 
on such variates, which we use here. In order to apply these bounds, we condition on the following 
"good event" , defined in terms of W 



A 



\Ti^w\\ 2 - \\n^w\\ 2 



J s\u\ 



< 



n 



n{l|n^|| 2 <2(n- S )} 



Note that the first event defining A ensures that 



d + v-t 



( "- >) + l^( l|n 



v 



W\ 



\TlsW\ 



n — s 
> > 0. 



(16) 
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Consequently, conditioned on A, we may set x := 4 (/j_ 2t/ ) m equation (|35bp to obtain the upper 
bound 



logP[A(£/) < | A] < 



(d + u- ty 

4(d + 2v) 



4([n- S ] + 2^ 



(n — s) 



< — (n — s) 



j ||n^n/|| 2 -||n^ vK|| 2 



(n-s) pf^ll 



4fl + 2 , 
1/2 



(n-s) ||/3J W I! 2 



(n — s) 



l + 4/||^ w P 



where inequality (b) makes use of the second event defining A. 
We complete the proof by observing that 

P[A(Z7)<0] < P[A([/)<0 I i]+P[#], 

so that it suffices to upper bound P[.A C ]. By union bound, we have 



(17) 



(18) 



F[A C ] < 



|n^|| 2 - ||n^|| 2 



3* 112 

? s\uW 



> 



n — s 



|n^|| 2 > 2(n-s) 



(19) 



Since ||II^W|| 2 is a central \ 2 with (n — s) degrees of freedom, we may apply the tail bounds from 
Appendix [D] to conclude that 



nJ)W\\ 2 > 2(n - s) < exp(-(n-s)/12). (20) 
Turning to the first term on the RHS on equation (|19p . we observe that 

||ir>|| 2 - nn^il 2 = ||n^|| 2 -pw|| 2 I £ zf- £ z 2 , 

ieu\s jes\u 

where {Zi, Zj} are i.i.d. standard normal variates. Now if the difference YlieUKS ~ ^2j&s\u * s 
to exceed ^(n — s)!!/?^^!) 2 , then at least one of the terms must exceed \{n — s)\\(5* s ^jj\\ 2 . Moreover, 
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we observe that Yljts\u ls where k = \S\U\. Hence, we have 



logP 



|n^|| 2 - ||n^iy|| 2 



5* 112 

7 s\uW 



> 



n — s 



< log2F 

= log 2P 

k 

< — 



^T>-M-s) - 

k 4 k 



'* II 2 



1 



X%-k> kl-l + -(n- s) 



o* 1 1 2 



* 112 



n 1, s n^sw 



+ log2, 



where we have used the upper bound (|34ap from Appendix[D]with x :- 
in the final inequality. 



-1 + Hn-s) 



^s\u" 



□ 



Weakened but simpler bound: In order to make further progress, we simplify the bound (| 15 j) 
from Lemma [21 at the expense of weakening it, by noting that for all k > 1, we have 
kM 2 (/3*), so that 



> 



. -(n-s)fc.M 2 (/3*)l fc 



n — s 



(21) 



The advantage of this weakened bound is that it is independent of the subset U, and depends only 
on the parameter k = \S\U\. 

From this weakened bound (|2ip . we see the necessity (at least for this analysis) of the require- 
ment (n — s)M. 2 {(3*) — ► +oo, so that the second error term decays asymptotically. Under this 
requirement, we have (for sufficiently large n) that the second error exponent can be bounded as 



M 2 {[3* 



1 



< 



< 



< 



M 2 ((3* 



k 

~12 

-(n- s)kM 2 (P*) 
12 (kM 2 ((3*) + 8)' 



1 



The first error exponent is also upper bounded by this same quantity, so that we can simplify the 
upper bound to 



F[A(U) < 0] < 3 exp 



-(n- s)kM 2 (P*) 
12(kM 2 (/3*) + 8) 



(22) 



Denote by N(k) the number of subsets U of size s, with overlap exactly equal to k. A standard 
counting argument yields that, for each k with 1 < k < s, there are 



N(k) 



p — s 
k 



(23) 
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such subsets. Using this simple bound (|22p and union bound applied to the representation ()14[) . 
we can upper bound the error probability as 

Analysis of the upper bound: We now analyze the upper bound (|24p ; in particular, our goal 
is to derive sufficient conditions for each of the terms in the summation to vanish asymptotically. 
In order to deal with the binomial coefficients, we make use of the bounds (see Appendix |C|) 

log Q < Hogp and log ( P ~ ^ < klog^^. (25) 

Applying these two bounds, we conclude that the (logarithm of the) k th term is upper bounded by 

(n- s)kM 2 {(3*) 



k 



s . p — s 
2 + log - + log — — 



12(kM 2 (P*) +. 



Requiring this term to be negative asymptotically is equivalent to having 



[U S) ~ kM 2 ((3*) k 



. s p — s 

2 + log - + log — — 



~ 12 ( fc + A5W))f + lo ^ + log ^}- <26> 

In order to understand the behavior of this lower bound, we consider k in two distinct regimes: 



On one hand, if k = js for some 7 G (0, 1), then the second term on the RHS of the bound (|26p 
is dominated by the term log = J7(log|), so that the overall lower bound is dominated 
by max{s, M~ 2 (f3*)} \og(p/s). 

• On the other hand, if k = o(s), the lower bound is dominated by the maximum of linear 
growth s, and the quantity A4~ 2 (/3*) log(p — s). 

Overall, we conclude that the condition 

n > C max jslog(p/s), J^j^ log(p-a)| , (27) 

for some constant C > is sufficient in order to achieve asymptotically reliable recovery, as claimed 
in Theorem [H 

2.3 Proof of Theorem g] 

We now turn to the proof of the necessary conditions given in Theorem [2l 
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Fano method: Our analysis is based on a well-known lower bound on the probability of error 
in a multiway hypothesis testing problem in terms of Kullback-Leibler divergences. In the non- 
parametric statistics literature [13 [13 [26] , this approach is referred to as the Fano method, since 
the bound is a corollary of Fano's inequality from information theory [6] . Here we state and make 
use of the following variant [25J: 



Lemma 3. Consider a family of N distributions {Pi, . . . , Pat}. Then the average probability of 
error in performing in a hypothesis test over this family is lower bounded as 

N 

W £ £>(Pi[|P;)+log2 

Pen ^ J- , / Ar , \ ; 

log (N - 1) 

where D(¥{ ||Pj) denotes the Kullback-Leibler divergence between distributions Pj andFj. 

Restricted problem: Consider the collection of all N = (f) subsets of size s chosen from 
{1, . . . ,p}. In order to produce lower bounds, we analyze the behavior of the optimal decoder for 
a restricted problem, in which we assume that for any fixed support S, it is known a priori that 
/3* = M(f3*) for all indices i S S. (Recall that M{j3*) is the minimum absolute value of entries 
in the support of /?*.) This problem is simply an iV-way hypothesis testing problem, in which the 
observation under the hypothesis associated with subset U takes the form 

Y = Xuv + W, (28) 

where v = A4((3*)l s is a rescaled s-vector of ones, and W ~ N(0, I n xn)- 

Let us index the collection of all s-sized subsets with i = 1, 2, . . . ,N, and use U[i] to denote the 
corresponding support. For each index i, let Pj denote the multivariate Gaussian distribution with 
mean Xuuiv and covariance matrix I nX n] note that Pj is simply the class-conditional distribution 
of Y under the hypothesis U[i]. Moreover, the Kullback-Leibler divergence between any such pair 
is given by _D(Pj || Fj) = ^\\Xjju-\V — XmWHl, so that the corresponding Fano bound takes the form 

. n 1 w £5=1 WM^- X um m + 2 log 2 
PcTT - 2 log[JV-l] 



Upper bounds via concentration: Thus, in order to ensure that p e stays bounded away from 
zero, we need to (upper) bound the quantity \jpt Y2ij=i \\Xu[i]V ~ -^f/yi^lli/ log[JV — 1] away 
from one. For a given pair of subsets (U, V) in our collection, consider the random variable 
Zjjy '■= \\Xzjv — XyvWz- A little calculation shows that Zjjy ~ r y(U,V)Xn: where 

j(U,V) = 2M 2 (P*) (s-\UnV\). (29) 
The following result bounds the upper tail behavior of the random variable Z = St/^v ^u,v- 
Lemma 4. The tail of Z obeys the bound 

P [Z > 4M 2 {(3*)sn] < -. 
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Using this lemma (see Appendix [B] for a proof of this claim), we are guaranteed that at least 1/2 
of the Gaussian ensembles satisfy the upper bound 



N 



1 



2 log[JV - 1] ~ log[7V - 1] ' 



(30) 



2 



log[iV - 1] 



Hence, as long as the quantity (|30p remains bounded from above away from one, the Fano bound 
implies that the probability of error averaged over the whole ensemble will remain bounded away 
from zero. Consequently, we obtain the necessary condition that 



for reliable recovery with probability one asymptotically. To obtain a more transparent bound, we 



as stated in Theorem [2) 

3 Conclusion 

In this paper, we have analyzed the information-theoretic limits of the sparsity recovery problem 
for the linear observation model (|2|) with measurement vectors drawn from the standard Gaussian 
ensemble. We have established both lower and upper bounds on the number of observations n as a 
function of the model dimension p and sparsity index s that are required for asymptotically reliable 
recovery. 

There are a variety of open questions raised by our analysis. First, while our upper and lower 
bounds are essentially matching for certain regimes of scaling (e.g., sublinear sparsity with the 
minimum _M 2 (/3*) = 0(l/s)), it is likely that the analysis can be tightened in other regimes. In 
particular, the analysis of the necessary conditions (see proof of Theorem[2]) involves some slack since 
it is based on analyzing a very restricted ensemble. Second, our results (in particular, a corollary 
of Theorem [T]) reveal that with the sparsity index scaling linearly (s = ap for some a £ (0,1)), 
as long, as the minimum value Ai 2 ((3*) decays sufficiently slowly, then asymptotically reliable 
recovery is possible with only a linear number of observations (i.e., n = j3p for some (5 > 0). Since 
our previous work [23] established that the Lasso (^i-constrained quadratic programming) cannot 
achieve reliable recovery in this particular (n,p, s) regime, it remains to determine a computationally 
tractable method that approaches such performance in the regime of linear sparsity. Third, whereas 
the current analysis has focused on a very special class of Gaussian ensemble, the analysis given 
here could be extended to a broader class of measurement ensembles. 



n > 



log[N - 1] 

AM 2 {l3*)s 




as stated in Appendix [Cj Consequently, we obtain the necessary condition 




(31) 
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A Proof of Lemma [T] 

We begin by showing that for any subset U for which X\j is full rank, the function / has the 
equivalent form f(U) = ||n^Y |||. Under the given rank condition, the linear least squares estimator 
of Pu is given by ft; = [X^Xu] XlY. Noting that Xufiu = ^uY, we substitute into the 
quadratic norm and expand, thereby obtaining 

f(u) = \\Y-XuftrWl = u^-n^ym = l|n^y||l 

as claimed. Lastly, to establish equation (|13|) . we note that 

f(u) = \\njj(x s p* s + w)f 2 = \\nij(x S \ U ir s \u + w)\\l 

since Tljjv = for any vector v belonging to the range of Xjj . 



B Proof of Lemma [4] 

Note that Z = ™ ^2 UV Zjjy is a rescaled sum of a total number A^ 2 variables (neither independent 
nor identically distributed). However, since Z is a non-negative random variable, we may apply 
Markov's inequality for any t > to conclude that 

F[Z>t] < =y±. (32) 

Since each Zjjy has distribution "j(U,V)Xni we have E[Z(/y] = ^{U,V)n. From equation (f29|) . we 
note that j(U, V) < 2M 2 ((3*)s, and hence 

E[Z] < max (j(U, V)) n = 2M 2 (p*)sn, 
Hence setting t = AA4 2 ((3*)sn in the bound (f32l) yields the claim. 



C Bounds on binomial coefficients 



Although more refined results are certainly possible, we make frequent use of the following crude 
bounds on the binomial coefficients 

n\ k fn\ /ne\ k , . 

*) s u s (-) ' (33> 
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D Tail bounds for chi-square variables 



The following large-deviations bounds for centralized \ 2 are taken from Laurent and Massart |17j . 
Given a centralized x 2_var i a te X with d degrees of freedom, then for all x > 0, 



X - d > 2Vdx + 2x 



< 



X - d > 2\J~dx 
X - d < -2\fdx 



< exp(— x), 

< exp(— x). 



and 



(34a) 
(34b) 



More generally, the analogous tail bounds for non-central x 2 j taken from Birge pQ, can be estab- 
lished via the Chernoff bound. Let X be a non-central x 2 variable with d degrees of freedom and 
non-centrality parameter u > 0. Then for all x > 0, 



X > (d + v) + 2y / (d + 2v)x + 2x < exp(-s) 



and 



X < (d + v) -2yJ(d+2v)x < exp(— x). 



(35a) 
(35b) 
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